This reports explores a data set containing quality and attributes for approximately 1600 wines.
The data set has 1599 wine records, with 12 variable for each record. According to the summary of the dataset, column X is the row number which means it’s meaningless in this case, so column X shall be omitted. Wine quality is scored in a level from 0 ~ 10, so this shall be changed to a ordered factor.
## [1] 1599 12
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Quality is the concern of this report, so I will start with the quality counts. Most of the wine scored at 5 or 6, only 10 of them scored 3, and 18 of them scored 8. Due to that the score is at least 3, and 10 will the the maximum quality, so in this analysis I take 3 as the worst quality, and 8 as good wine, 5 and 6 are in avearage.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The fixed.acidity distribution is slightly skewed, with most of the fixed acidity around 7.0 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volatile acidity content in the wine is generally much lower than fixed acidity. Too high level of volatile acidity will bring unpleasant taste. As shown in this plot, most wine have volatile acidity at around 0.52 g/dm^3, the outliers for this is 1.58 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The citric acid is distributed among 0 and 0.75 g/dm^3. We can see that citric acid amount is almost uniformly distributed, among which there are lots of wine with zero citric.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Most of the residual sugar content are in a range of 1.900 to 2.600, centered by 2.5 g/dm^3. According to the plots, the outliers have redisual sugar more than 8.0 g/dm^3 in the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chloride count is normally distributed, mostly at 0.079 g/dm^3, while there are also some wine have more than 0.2 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The distribution of free sulfur dioxide is skewed to the right, so I transform the x axis by ‘sqrt’
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
By transforming the x-axis, the distribution looks more clear now. Most of the free sulfru dioxide are in a range of 7 to 40. I am wondering if the sulfur dioxide/ SO2 will affect the taste of the wine.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1080 7.9 0.3 0.68 8.3 0.05
## 1082 7.9 0.3 0.68 8.3 0.05
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1080 37.5 278 0.99316 3.01 0.51
## 1082 37.5 289 0.99316 3.01 0.51
## alcohol quality
## 1080 12.3 7
## 1082 12.3 7
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 110 8.1 0.785 0.52 2.0 0.122
## 355 6.1 0.210 0.40 1.4 0.066
## 516 8.5 0.655 0.49 6.1 0.122
## 652 9.8 0.880 0.25 2.5 0.104
## 673 9.8 1.240 0.34 2.0 0.079
## 685 9.8 0.980 0.32 2.3 0.078
## 1245 5.9 0.290 0.25 13.4 0.067
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 110 37.0 153 0.99690 3.21 0.69
## 355 40.5 165 0.99120 3.25 0.59
## 516 34.0 151 1.00100 3.31 1.14
## 652 35.0 155 1.00100 3.41 0.67
## 673 32.0 151 0.99800 3.15 0.53
## 685 35.0 152 0.99800 3.25 0.48
## 1245 72.0 160 0.99721 3.33 0.54
## alcohol quality
## 110 9.3 5
## 355 11.9 6
## 516 9.3 5
## 652 11.2 5
## 673 9.5 5
## 685 9.4 5
## 1245 10.3 6
The distribution of free sulfur dioxide is skewed to right, with most of the contents is 14.00 mg/dm^3. So I transformed the x by log10(). After the transforming, the distribution is normal now.
There are two outliers for total sulfur dioxide >200, and 7 samples >150, all of them have quality of 5~7, which means when total sulfur dioxide is high, the taste may be above avearge. However, the sample population is not big enough, we cannot get any conculsion about this at this moment.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02273 0.25930 0.37500 0.38230 0.48480 0.85710
The ratio of free sulfur dioxide to the total sulfur dioxide is normally distributed. Most of the wines have the ratio around 0.259 to 0.485.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The density of the wine is normaly distributed with most of the density at 0.9956 g/cm^3, and there are very rare wine has density higher than puur water (1 g/cm^3),
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Most wine in this dataset are between 3.21 and 3.40 in pH scale.
Sulphates is an additive acting as an antimicrobial and antioxidant in the wine. Most wine has sulphates between 0.5 to 0.8 g/dm^3. The distribution is slightly skewed to the right.
The alcholo content is slightly skewed to the right, while most of wine have alcholo between 5% and 13%.
The data set has 1599 wine records, with 12 variable for each record. The variable quality is ordered factor variable with level from 3 ~8 (higher level means better taste)
Other observations are - Most wine quality are 5 and 6 - Most wine have density less than water
To get an intial sense of the relationship between the variables, a grid plot was conducted. Abbreviation in the figure are as follows: - f_acid: fixed acidity: g/dm^3 - v_acid: volatile acidity: g/dm^3 - c_acid: citric acid: g/dm^3 - r_sg: redisual sugar: g/dm^3 - fsd: free sulfur dioxide: mg/dm^3 - tsd: total sulfur dioxide: mg/cm^3 - den: density: g/cm^3 - slf: sulphates: g/dm^3 - pH: pH - alc: alcohol: % - quality: quality (1- 10)
According to the correlation coefficients, alcohol is the strongest factor to correlate with quality, followed by volatile acidity, with correlation coefficients of 0.476 and -0.391 respecitively. Sulphates and citric acid also show a weak correlation with the quality.
In addition to these, citric acid is correlated with volatile acidity ( -0.552 ), and density is correlated with alcohol (-0.496)
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.230 12.600
The fixed acidity varies at different qualtiy wine. Wine from quality of 3 to 8 all have same range of fixed accidity, indicating that the fixed acidity doesn’t look like to be correlated with the quality.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
The relationship between volatile acidity and quality is clear. The lower qualty wine tend to have higher volatile acidity scale. The median volatile acidity of quality level 8 (0.370) is much lower than of level 3 (0.845).
However, even some high quality wine have volatile acidity higher than the median value of level 3. We can see that volatile is very important factor for the wine taste, but it doesnt mean high volatile acidity will definetely tastes bad.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
Though wine with zero citric acid may be found within all quality range, the plots show that the wine with higher quality tends to have higher citric acid content, indicating citric acid is weakly correlated with the quality.
Volatile acidity is correlated with citric acid. The higher citric acid the wine has, the less volatile acidity it has.
However, the ratio of citric acid to volatile acidity doesnt show any trend with quality.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
Both residual sugar and chloride don’t look like to be correlated with the quality.
slfr_p <- bi_jitter("quality", "slf_rate")
grid.arrange(slfr_p)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 15 8.9 0.620 0.18 3.8 0.176
## 16 8.9 0.620 0.19 3.9 0.170
## 58 7.5 0.630 0.12 5.1 0.111
## 397 6.6 0.735 0.02 7.9 0.122
## 401 6.6 0.735 0.02 7.9 0.122
## 585 11.8 0.330 0.49 3.4 0.093
## 926 8.6 0.220 0.36 1.9 0.064
## 927 9.4 0.240 0.33 2.3 0.061
## 983 7.3 0.520 0.32 2.1 0.070
## 1132 5.9 0.190 0.21 1.7 0.045
## 1155 6.6 0.580 0.00 2.2 0.100
## 1245 5.9 0.290 0.25 13.4 0.067
## 1296 6.6 0.630 0.00 4.3 0.093
## 1297 6.6 0.630 0.00 4.3 0.093
## 1359 7.4 0.640 0.17 5.4 0.168
## 1435 10.2 0.540 0.37 15.4 0.214
## 1436 10.2 0.540 0.37 15.4 0.214
## 1559 6.9 0.630 0.33 6.7 0.235
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 15 52 145.0 0.99860 3.16 0.88
## 16 51 148.0 0.99860 3.17 0.93
## 58 50 110.0 0.99830 3.26 0.77
## 397 68 124.0 0.99940 3.47 0.53
## 401 68 124.0 0.99940 3.47 0.53
## 585 54 80.0 1.00020 3.30 0.76
## 926 53 77.0 0.99604 3.47 0.87
## 927 52 73.0 0.99786 3.47 0.90
## 983 51 70.0 0.99418 3.34 0.82
## 1132 57 135.0 0.99341 3.32 0.44
## 1155 50 63.0 0.99544 3.59 0.68
## 1245 72 160.0 0.99721 3.33 0.54
## 1296 51 77.5 0.99558 3.20 0.45
## 1297 51 77.5 0.99558 3.20 0.45
## 1359 52 98.0 0.99736 3.28 0.50
## 1435 55 95.0 1.00369 3.18 0.77
## 1436 55 95.0 1.00369 3.18 0.77
## 1559 66 115.0 0.99787 3.22 0.56
## alcohol quality slf_rate
## 15 9.2 5 0.3586207
## 16 9.2 5 0.3445946
## 58 9.4 5 0.4545455
## 397 9.9 5 0.5483871
## 401 9.9 5 0.5483871
## 585 10.7 7 0.6750000
## 926 11.0 7 0.6883117
## 927 10.2 6 0.7123288
## 983 12.9 6 0.7285714
## 1132 9.5 5 0.4222222
## 1155 11.4 6 0.7936508
## 1245 10.3 6 0.4500000
## 1296 9.5 5 0.6580645
## 1297 9.5 5 0.6580645
## 1359 9.5 5 0.5306122
## 1435 9.0 6 0.5789474
## 1436 9.0 6 0.5789474
## 1559 9.5 5 0.5739130
Nothing stands out for free sulfur dioxide and total sulfur dioxide, neither does the rate of free sulfur dioxide to total sulfur dioxide. According to theory, when free sulfur dioxide is more than 50, the less total sulfur dioxide the wine has, the better its quality will be. However, there are only 18 samples showing this trend, which is not convincing enough for the conclusion.
There’s not obvious relationship between sulphates and free sulphur dioxide or total sulphur dioxide, however, the sulphates is correlated with the quality. Higher suphates contents tend to get higher quality scores.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9962 0.9976 0.9975 0.9988 1.0010
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9956 0.9965 0.9965 0.9974 1.0010
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0030
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0040
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0030
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.160 3.312 3.390 3.398 3.495 3.630
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.370 3.382 3.500 3.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.300 3.305 3.400 3.740
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.920 3.200 3.280 3.291 3.380 3.780
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.162 3.230 3.267 3.350 3.720
Both density and pH don’t look to be correlated with quality.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Alcohol shows the strongest correlation with quality. According to the plots, higher alcohol tends to get higher quality. The median alcohol for level 8 (12.09) is much higher than of level 3 (9.925).
The density is related to the residual sugar and alcohol. Higher redisual sugar leads to higher density, while higher alcohol leads to lower density.
We can see that there are lots of wine with zero citric acid. So I extracted all these samples to see their quality. Most of these wines are 5~6, which is not surprising. By checking the volatile ~ quality plot, the trend that high volatile acidity get lower quality is also clear. Which indicates that volatile is strongly correlated with quality.
The citric acid looks like linearly corelated with the volatile acidity. Wine with higher citric acid tends to have lower volatile acidity. However, the ratio of citric acid to volotaile acidity doesn’t show any reltionship with the quality.
Overall, higher sulphate leads to better tastes, however, the sulphate content difference between good quality and bad quality wine doesn’t stand out.
Residual sugar and alcohol are both corelated with density of the wine. However, alcohol shows stronger difference in different wine qualities.
Alcohol and volatile acidity are the strongest factors to be correlated with quality, however, these two factors dont affect each other by the trend. The plot also shows that most dark dots(high quality) stays at high alcohol and low volatile acidity, while light dots(low quality) stay at low alcohol and high volatile acidity area.
Most of the wine got a score at 5~6, while there are also low scores of 3, and high scores for 8.
Volatile acidity is strongly correlated with quality, wine with higher volatile acidity tends to have lower quality, the trend is also shown from the red line representing the median value of each quality group.
Alcohol and volatile acidity are the strongest factors to be correlated with quality, however, these two factors don’t affect each other by the trend. The plot also shows that most dark dots(high quality) stays at high alcohol and low volatile acidity, while light dots(low quality) stay at low alcohol and high volatile acidity area.
The data set has 1599 wine records, with 12 variable for each record. I started exploring the data by plotting the distribution, then investigate the relationship between the main features. Eventually I studied the three major attributes affecting the wine taste.
There is a clear trend between alcohol and quality, volatile acidity and quality.
There are also some limitations about this dataset. As to the relationship between total sulfur dioxide and taste when free sulfur dioxide is higher than 50 ppm, high total sulfur dioxide affected the tastes of the wine. This can be found from the data set, however, there are only 13 wine samples with free sulfur dioxide than 50ppm, which is not enough to prove the theory.